fix: prevent self-update re-exec loop by TerrifiedBug · Pull Request #16 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-05T19:26:47Z

Summary

Skip update when agent is already running the target version — prevents infinite syscall.Exec loop where the new process polls, gets the same pendingAction, and re-execs again before a heartbeat can clear it
Shut down all Vector pipelines before syscall.Exec to prevent orphaned processes and port binding conflicts

Context

After a successful update, the agent re-execs via syscall.Exec. The new process starts, polls config immediately, and gets the same pendingAction because no heartbeat was sent to clear it. This creates a tight loop of download → verify → replace → re-exec → repeat every ~1 second.

Test plan

Trigger a binary update from the fleet UI
Verify agent updates once and stops (no repeated "self-update requested" logs)
Verify Vector pipelines restart cleanly without "Address in use" errors
Verify the agent correctly reports its version after update

- Skip update when already running target version (breaks re-exec loop) - Shut down all pipelines before syscall.Exec to prevent port conflicts

greptile-apps · 2026-03-05T19:29:30Z

Greptile Summary

This PR fixes two related issues in the agent's self-update flow: an infinite syscall.Exec loop caused by the new process immediately re-triggering the same pendingAction, and orphaned Vector child processes left behind during re-exec.

Changes:

agent.go: Adds a version guard — if action.TargetVersion == Version, the update is skipped and execution falls through to the next heartbeat, which clears the pendingAction on the server side. This is the primary fix for the re-exec loop.
updater.go: Calls supervisor.ShutdownAll() between the atomic binary rename and syscall.Exec, ensuring all Vector processes are gracefully terminated before the new agent image takes over.

Correctness: The version guard is well-placed — it runs before the failedUpdateVersion check and before any I/O, so the fast path is cheap. The ShutdownAll() call runs goroutines in parallel and already has a per-process 30 s SIGTERM → SIGKILL timeout in stopProcess, so the worst-case blocking window is bounded at 30 s regardless of the number of running pipelines.

Confidence Score: 4/5

This PR is safe to merge; both changes are minimal, well-scoped, and correctly address the described loop and orphan-process issues.
The version-equality check is the correct fix for the re-exec loop, and ShutdownAll() cleanly prevents port conflicts. The one minor concern — ShutdownAll() can hold the pre-exec window open for up to 30 s after the binary is already replaced — is a bounded, acceptable trade-off but deserves a comment for future maintainers.
No files require special attention beyond the style note on updater.go line 90.

Important Files Changed

Filename	Overview
agent/internal/agent/agent.go	Adds an early-exit guard in `handlePendingAction` that skips the self-update when the agent is already running `action.TargetVersion`, preventing the post-re-exec polling loop.
agent/internal/agent/updater.go	Calls `supervisor.ShutdownAll()` between the atomic binary rename and `syscall.Exec` to prevent orphaned Vector processes. ShutdownAll can block up to 30 s (SIGTERM timeout) after the binary is already replaced, which is an acceptable trade-off but worth documenting.

Sequence Diagram

sequenceDiagram
    participant A1 as Agent v1
    participant S as Server
    participant FS as Filesystem
    participant A2 as Agent v2

    A1->>S: pollAndApply()
    S-->>A1: pendingAction {self_update, targetVersion: v2}
    Note over A1: Version check: "v1" ≠ "v2" → proceed
    A1->>FS: Download + verify binary
    A1->>FS: os.Rename(tmp → execPath)
    A1->>A1: supervisor.ShutdownAll() (≤30s)
    A1->>A2: syscall.Exec (process replaced)

    A2->>S: pollAndApply()
    S-->>A2: pendingAction {self_update, targetVersion: v2}
    Note over A2: Version check: "v2" == "v2" → SKIP (loop prevented)
    A2->>S: sendHeartbeat()
    S-->>A2: pendingAction cleared

_{Last reviewed commit: b4b3fae}

greptile-apps · 2026-03-05T19:29:33Z

agent/internal/agent/updater.go

+	// Shut down all pipelines before re-exec to avoid orphaned Vector processes
+	// and port conflicts when the new agent starts them again.
+	slog.Info("stopping all pipelines before re-exec")
+	a.supervisor.ShutdownAll()


ShutdownAll can block up to 30 seconds post-rename

ShutdownAll() sends SIGTERM to every Vector process and then waits up to 30 seconds for each one (they run in parallel, so worst-case is 30 s total). This happens after the new binary has already been atomically renamed into place at execPath. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.

The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping ShutdownAll in a context with a hard deadline to make the bound explicit:

// ShutdownAll waits for all Vector processes to exit (up to 30 s each, // running in parallel) before we re-exec. The new binary is already in // place at execPath; if this process is killed here the agent will need // to be restarted manually. slog.Info("stopping all pipelines before re-exec") a.supervisor.ShutdownAll()

Prompt To Fix With AI

This is a comment left during a code review. Path: agent/internal/agent/updater.go Line: 90 Comment: **`ShutdownAll` can block up to 30 seconds post-rename** `ShutdownAll()` sends `SIGTERM` to every Vector process and then waits up to **30 seconds** for each one (they run in parallel, so worst-case is 30 s total). This happens *after* the new binary has already been atomically renamed into place at `execPath`. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it. The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping `ShutdownAll` in a context with a hard deadline to make the bound explicit: ```go // ShutdownAll waits for all Vector processes to exit (up to 30 s each, // running in parallel) before we re-exec. The new binary is already in // place at execPath; if this process is killed here the agent will need // to be restarted manually. slog.Info("stopping all pipelines before re-exec") a.supervisor.ShutdownAll() ``` How can I resolve this? If you propose a fix, please make it concise.

fix: prevent self-update re-exec loop and orphaned Vector processes

b4b3fae

- Skip update when already running target version (breaks re-exec loop) - Shut down all pipelines before syscall.Exec to prevent port conflicts

github-actions bot added fix agent and removed fix labels Mar 5, 2026

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

TerrifiedBug merged commit acc199e into main Mar 5, 2026
12 checks passed

TerrifiedBug deleted the fix/update-loop-v2 branch March 5, 2026 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent self-update re-exec loop#16

fix: prevent self-update re-exec loop#16
TerrifiedBug merged 1 commit intomainfrom
fix/update-loop-v2

TerrifiedBug commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TerrifiedBug commented Mar 5, 2026

Summary

Context

Test plan

Uh oh!

greptile-apps bot commented Mar 5, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant